﻿Masteratul de Lingvistică Computațională Curs: Introducere in Lingvistica Computațională Curs 10 Question-answering and how to evaluate an NLP system Curs: Dan Cristea Seminarii & proiect: Mihaela Onofrei, Dan Cristea Question-answering systems Question Answering - Definition • Question Answering (QA): a QA system takes as input a question in natural language and produces one or more ranked answers from a collection of documents – By providing a small set of exact answers to questions, QA takes a step closer to information retrieval rather than to document retrieval Information Retrieval (IT) vs Document Retrieval (DR) • IR: finding information, usually matching some user- defined fix structure, in a collection of unstructured nature (usually texts) • DR: finding a set of relevant documents, satisfying some user-defined criteria, in a collection of unstructured nature (usually texts) Modules • In a classical approach, QA systems normally adhere to the pipeline architecture composed of three main modules: – question analysis – the results are: keywords, answer and question type, focus – paragraph retrieval - the results are a set of relevant candidate paragraphs/sentences from the document collection – answer extraction – the results are a set of candidate answers ranked using likelihood measures Question Type – Factoid – “Who discovered the oxygen?”, “When did Hawaii become a state?” or “What football team won the World Coup in 1992?” – List – “What countries export oil?” or “What are the regions preferred by the Americans for holidays?” – Definition – “What is a quasar?” or “What is a question-answering system?” – Other – how-questions, why-questions, hypothetical questions, polar (Yes/No) questions, cross-lingual questions Harabagiu, S & Moldovan, D (2003) Question Answering In R Mitkov (ed ) The Oxford Handbook of Computational Linguistics 8 Answer Type • Person - "What”, "Who”, "Whom", "With who" • Location (City, Country, and Region) - "What state/city“, "From where”, "Where“ • Organization - "Who produced“, "Who made“ • Temporal (Date and Year) – “When” • Measure (Length, Surface and Other) – “How much” • Count - "How many“ • Yes/No – “Did the girl fear that?”, “Is the car blue?” The search collection • Closed-domain – look for answers in local collections, internal organization documents, newspapers, etc – deals with questions from a specific domain (medical, baseball, etc ) – can exploit domain-specific knowledge (ontologies, rules, disambiguation) • Open-domain – look for answers on the internet – general questions about anything – can use general knowledge about the world History of QA systems • The first systems, created in the 60s – BASEBALL (Green et al , 1963) - answer questions about baseball games • Systems of the 70s – LUNAR (Woods & Kaplan, 1977) – geological analysis of rocks returned by the Apollo moon missions • Systems of the 80s: – IURES (Cristea et al , 1985) – medical domain, querying national programs library ç semantic network driven – QUERNAL (Cristea, 1987) – personnel database, drilling and extraction, metallurgy, geography ç rules based (syntagma) QA systems of today • Powerset: http://www powerset com/ (http://www bing com/) • Assimov – the chatbot: http://talkingrobot org/ • IBM Watson: https://www ibm com/watson – Djingo – the multi-service virtual assistant from Orange (controlled by voice or text): https://www orange com/en/Human-Inside/Shaping-tomorrow-s- world/SH2017/Djingo-your-multi-service-virtual-assistant • allows the user to navigate Orange TV, manage the connected home, make a call or access other services • Amazon Alexa… • Google play devices… QA Competitions òTREC (Text REtrieval Conference) - started in 1992 http://trec nist gov/ ò National Institute of Standards and Technology (NIST), CGaithersburg, Maryland, USA EF (Cross Language Evaluation Forum) started in 2000 - http://www clef-campaign org/ European languages in both monolingual and cross-language contexts ò Coordination: Istituto di Scienza e Tecnologie dell'Informazione, Pisa, Italy An excerpt from a ‘gold standard' file UAIC System components 16 Background knowledge Test data (documents, questions, possible answers) Questions processing: Lemmatization Stop words elimination NEs identification Lucene query Answers processing: - Lemmatization - Stop words elimination - NEs identification - Lucene query documents Identify relevant documents Lucene indexes 2 Partial and global scores per answers Background knowledge indexing òThe Romanian background knowledge has 161,279 documents in text format ò25,033 correspond to the AIDS topic ò51,130 to Climate Change topic ò85,116 to Music and Society topic òThe indexing component processes the name of the file and the text of the article => Lucene index 1 Test data processing – Processing questions • Stop words elimination • Lemmatization • Named Entity identification • Lucene query building Test data processing – Processing possible answers • Ontologies used (Iftene and Balahur, 2008), for instance, to locate cities (relation [is located in]) – In which European cities has Annie Lennox performed? – Answers containing non-European cities are eliminated (here: replaced with the value XXXXX) Searching using relevant documents for questions • Then in every index, queries associated to possible answers are asked for using Lucene • For every answer, a list of documents displaying Lucene relevance scores is obtained – Score2(d, a) is the relevance score for document d when searched with the Lucene query associated to answer a References • BASEBALL: Green, B F Jr et al (1963): Baseball: An Automatic Question Answerer, in Edward A Feigenbaum and Julian Feldman (eds), Computers and Thought, New York, McGraw-Hill Book Company, p 207-216 • LUNAR: Woods, W A , Kaplan, R (1977) Lunar rocks in natural English: Explorations in natural language question answering, in Linguistic structures processing • IURES: Cristea, D , Tufiș, D , Mihăescu, T (1985) IURES: A Computer Natural Language Question-Answering System with Possible Medical Applications, in Rev Med Chir Soc Med Nat Iași, LXXXIX(3), p 511-516 • QUERNAL: Cristea, D (1987) Sistemul QUERNAL, In Giumale, C , Preoţescu, D , Şerbănaţi, L D , Tufiş, D , Cristea, D : LISP, pp 215-229, Editura Tehnică, Bucureşti • QA-UAIC@CLEF: Iftene, A & Balahur, A (2008) Answer validation on English and Romanian languages, in Workshop of the Cross-Language Evaluation Forum for European Languages, 448-451 How to evaluate an NLP system? NLP Systems need evaluation } “An important recent development in NLP has been the use of much more rigorous standards for the evaluation of NLP systems” Manning and Schutze } To be published, all research must: ◦ establish a baseline, and ◦ quantitatively show that it improves on the baseline and the state- of-the-art NLP Systems – ways of doing evaluation • Answer the question: “How well does the system work?” • Possible domains for evaluation – Processing time of the system – Space usage of the system – Human satisfaction – Correctness of results • Measures: (Accuracy, Error), (Precision, Recall, F-measure) Accuracy and Error • To evaluate means to compare the system output against a gold standard • The results of a system are marked as: – Correct: matches the gold standard – Incorrect: otherwise Accuracy and Error - Example òA system that detects the language òAccuracy = 66 66 % òError = 33 33 % Precision and Recall òPrecision and Recall are set-based measures òThey evaluate the quality of some set membership, based on a reference set membership òPrecision: what proportion of the retrieved documents is relevant? òRecall: what proportion of the relevant documents is retrieved? F-measure (F-score or F1-score) òF- measure is a measure of a test's accuracy, and it considers both the precision p and the recall r òGeneral formula: òF1-measure: How to calibrate a module? • Suppose we want to build a module to achieve a particular goal Then, in fact, we will have to build 3 modules: – The Training Module (TM) – The module itself (X) – The Evaluation Module (EM) Evaluation measures • Precision = common items in Test & Gold/ items in Test • Recall = common items in Test & Gold/ items in Gold • F-measure = 2 * P * R / (P + R)